May 22, 2018

Tēnā koutou

Overview

Overview

  • What is Data Science?
    • Common Topics
    • Tools I use
  • Intro to Machine Learning
    • Types of learning
    • Practical ML
  • Getting started with Python
  • Q & A

What is Data Science?

Data Science is:

From wikipedia:

"Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms"

Data science can cover any or all of the following:

  • Data visualisation
  • Statistical modelling
  • Machine learning
  • Automation of data-related processes
  • Big data/Distributed computing

Multidisciplinary

Technical

Tools for Data Science

Languages

  • The most popular programming languages for data science are Python and R.
  • Some flavour of SQL is also often necessary
  • For big data, Scala is popular (but Python/R can do big data too)
  • Learn git! Github is also awesome

If you want a recommendation, I'd say Python (but this is a matter of opinion).

In Python

Machine Learning

What is machine learning?

From wikipedia (edited):

Machine learning is a sub-field of computer science and statistics which aims to give computers the ability to progressively improve performance on a specific task with data, without being explicitly programmed.

When it works, the idea is:

  • Collect a lot of data about a problem
  • Point an pattern finding algorithm at the data
  • …?
  • Profit

How does a machine learn?

Supervised learning

How does a machine learn?

Unsupervised learning

How does a machine learn?

Reinforcement learning

Supervised Learning

Setting the problem

To get off the ground with a Machine Learning problem, you need:

  • A lot of training data (i.e. examples)
  • Something you want to predict (target values)
  • An algorithm to predict the target values, from the example data

Machine learning works by iterating through the training, predicting the targets via a model, and improving the model by testing its accuracy as you go along.

Train → Predict → Improve

Classification or Regression

There are many different kinds of models, but two important classes of ML problems are classification and regression.

Getting started

Imagine you wanted to train a classifier to distinguish apples 🍏 from oranges 🍊, based on the weight and texture of the fruit.

weight texture label
150g Bumpy Orange
170g Bumpy Orange
140g Smooth Apple
130g Smooth Apple

Featurisation

The first thing you have to do is find a way to encode the weight, texture and label of the fruit using numbers. One example of how to do this is given below:

weight texture label texture_name label_name
150 0 1 Bumpy Orange
170 0 1 Bumpy Orange
140 1 1 Smooth Apple


Question: is there another way we could have translated these into numbers?

Decision Trees

Decision trees are a basic algorithm for doing classification. This is an example of how to train a simple classifier using the scikit-learn package

from sklearn import tree

features = [[140, 1], [130, 1], [150, 0], [170, 0]]
labels = [0, 0, 1, 1]

clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)

print(cl"result is:", f.predict([[160, 0]]))
# result is [1]